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Clustering is a robust machine learning task that involves dividing data 
points into a set of groups with similar traits. One of the widely used 
methods in this regard is the k-means clustering algorithm due to its 
simplicity and effectiveness. However, this algorithm suffers from the 
problem of predicting the number and coordinates of the initial clustering 


centers. In this paper, a method based on the first artificial bee colony 
algorithm with variable-length individuals is proposed to overcome the 
limitations of the k-means algorithm. Therefore, the proposed technique will 
automatically predict the clusters number (the value of k) and determine the 
most suitable coordinates for the initial centers of clustering instead of 
manually presetting them. The results were encouraging compared with the 
traditional k-means algorithm on three real-life clustering datasets. The 
proposed algorithm outperforms the traditional k-means algorithm for all 
tested real-life datasets. 
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1. INTRODUCTION 

It is important to utilize various data mining techniques, including cluster analysis, to identify, 
analyze, and categorize data attributes. Researching in data clustering is still active. It is used extensively in 
various fields, such as medical sciences, image analysis, machine learning, web cluster engines, 
classification, knowledge discovery, and software engineering. Clustering is splitting the area or population 
into groups to make data points in the same group more comparable than data points in other groups. It is one 
of the most often used strategies for unsupervised classification. Several exist a variety of unsupervised 
clustering algorithms available including k-means [l], cobweb [2], farthest-first [3], expectation- 
maximization (EM) [4], density-based [5], and hierarchical clustering [6]. On the whole, though, the most 
widely used is the k-means algorithm. 

K-means clustering is a well-known partition algorithm. It was widely utilized in scientific research 
and industrial applications due to its simplicity, rapid convergence, and suitability for massive data sets 
processing, among others. The traditional k-means clustering method allocated random beginning points 
during clustering center initialization and typically found a local optimum clustering result. As a result, the 
lack of stability affected categorization accuracy, and a globally optimized method is required to overcome 
the limitations of this algorithm. 

Numerous researches have been conducted to address this issue. For instance, Maulik and 
Bandyopadhyay [7] suggest using a genetic algorithm (GA) to search in the feature space for cluster centers 
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to optimize a similarity measure. The particle swarm optimization (PSO) algorithm was used to locate the 
centroids of several clusters that the user specifies [8]-[10]. Kao et al. [11] propose a hybrid method, namely 
combining k-means, Nelder-Mead simplex search, and PSO (K-NM-PSO). Niknam and Amiri [12] 
suggested the hybrid evolutionary algorithm, fuzzy adaptive particle swarm optimization-ant colony 
optimization- k-means algorithms (FAPSO-ACO-K), which has a greater chance of clustering. Laszlo and 
Mukherjee [13] proposed a genetic algorithm (GA) approach for seeding the k-means clustering method with 
centers using a unique crossover operator that swaps adjacent centers. Nguyen and Cios [14] developed a 
clustering method, called genetic algorithm k-means logarithmic regression expectation maximization 
(GAKREM), capable of performing clustering without requiring the predetermined number of groups. It is a 
hybrid of genetic algorithm, K-means, logarithmic regression, and expectation-maximization. Armano and 
Farmani [15] suggested an approach the combination of k-means and artificial bee colony algorithms 
(kABC), that used the artificial bee colony (ABC) algorithm to enhance k-means' capacity to identify global 
optimal clusters in nonlinear partitional clustering situations. Karaboga and Ozturk [16] applied the ABC 
algorithm as data clustering on classification benchmark problems. Zhang et al. [17] presented how to divide 
N items into k clusters using the ABC algorithm. Zou et al. [18] introduced the cooperative ABC (CABC), an 
expanded ABC method that outperforms the original ABC in addressing complex optimization problems. It 
was used to solve issues with clustering. Bonab et al. [19] utilized the k-means algorithm with the ABC 
algorithm and the differential evolutionary algorithm to get the optimal solution of objects in datasets and 
images. Wang and Wang [20] employed a hybrid algorithm from the k-means and ABC algorithms to do the 
clustering procedure. The present work is a step forward in this respect. We will address the defects of the 
k-means algorithm using the proposed first ABC algorithm with a variable-length food source to predict the 
optimal number of clusters (the value of k) and the initial centroids for the k-means method. 

In the remaining portion of this paper, the k-means algorithm and ABC algorithm are briefly 
reviewed in section 2. Section 3 describes the proposed method. Results are explained in section 4. Section 5 
discusses the current findings. Section 6 presents the conclusion. 


2. BACKGROUND 
2.1. K-means clustering algorithm 

K-means clustering algorithm is simple, an iterative, numerical, unsupervised, and non-deterministic 
technique [21]. It is commonly used in data mining for grouping huge sets of datasets. It is a partitioning 
clustering algorithm that divides the supplied datasets into k distinct clusters over an iterative operation that 
meets a local minimum. 

K-means algorithm starts with k, a user-specified parameter, initial cluster centers randomly chosen 
from the dataset. In each iteration, each point at a given dataset is assigned to the closest cluster center. After 
categorizing all data points into clusters, the new centroid of each cluster is re-calculated as the mean of all 
cluster points. The procedure is continued until the centroids are still the same. 


2.2. Artificial bee colony algorithm 

The ABC algorithm is based on actual honey bee behavior [22]. It comprises three different groups 
of bees: bees that are working, watching, and scouting. There are observers in the first half of the colony, 
which are then employed artificial bees, and in the second half, which are the onlookers, there are employed 
artificial bees. Only one bee is used in the collection of each type of food. This statement can be summarized 
as saying that the number of employed bees equals the number of food sources. When the bees have depleted 
the food supply, the hired bee takes on the role of a scout. 

The ABC algorithm is divided into four stages: initialization, employed bees, spectator bees, and 
scout bees, as presented in the following lines. 
— Initialization: a random number of solutions is placed in the search space using (1): 


Xij = Lb; + rand (0,1) * (Ub; — Lb;) 7 


where Xi=/..., sn={Xil, Kid. as Xip}, SN is the set of all possible solutions, D is the number of 
optimization parameters that need to be determined, and the lower and upper limits of the solution site in 
dimension j are defined as Lb; and Ub;, respectively. 

— Employed bees phase: employed bees and candidate solutions have one-to-one communication. 
Employed bees visit each of their solution candidates and change one dimension of the visited solution 
until they find the solution using (2): 
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Vij = Xy + piy * (Xi — Xr) (2) 


where j is a uniformly chosen dimension at random from //,2, ..., D}, ọ is a numeric value generated at 
random [-1 1], Xx solution is chosen at random from the population with the condition (i#k). In a manner 
similar to how the solutions in the bee's memory replace each other, when Vj; is larger than the previous 
position, the new solution replaces the old one; otherwise, the previous solution's location is retained. 
Once all worker bees have completed their search, they perform a waggle dance to alert spectator bees to 
their food sources. 

— Onlooker bees phase: employed bees assess the fitness, fitness;, of a solution X; using (3): 


1 
i X;) = 0, 
fitness; =} fd f fX) m 


Each onlooker bee visits a solution based on the selection probability of a candidate solution X; using (4). 


fitnessi 
P; = DPN fitness; (4) 
When an onlooker bee chooses a solution, it looks for a new, better option. As a result, the employed bees 
step controls the ABC algorithm's diversification behavior, whereas the observer bees step controls the 
algorithm's intensification behavior. 
— Scout bees step: in this phase, every food source that does not improve beyond a specific ‘limit’ of trials 
is abandoned and replaced with a new location, which serves as a scout for the bee employed, using (1). 


3. THE PROPOSED METHOD 

To overcome the previously mentioned limitations of the k-means algorithm, this section discusses 
the first mechanism to enhance the basic ABC algorithm to deal with variable-length representation 
(ABCVL). Different lengths, i.e., numbers of features, individuals may focus on various search space areas. 
Based on this ability, we first develop a new initialization technique to generate variable-length individuals to 
provide a suitable diversity level for the whole. In addition, a special search equation is designed to deal with 
variable-length individuals. The ABCVL algorithm was adopted to select the best number of clusters (K) and 
the initial centroids for the k-means algorithm, as discussed below. 


3.1. Initial population 

Each food source is represented by a variable length from one source to another. We follow the 
principle of variable-length chromosomes [23], [24] to represent individuals in the population. In this phase, 
the population of SN individuals is randomly initialized with a different value of optimization parameters (k). 
Each individual, ind, in the population is represented in the following: 


ind; = {51.52.53 Sie}, 1<j<SN 


where SN is the number of food sources (solutions) in the population, the value of s is selected randomly 
from the dataset, k; indicates that the individual j has k centers of clustering whose value varies from one 
individual to another depending on the minimum, nCmin and maximum, NCnax, number of centers, where 


NCmin SK < NCpax 


3.2. Objective function 
To measure the overall k-means clustering quality of food sources, we used different objective 
functions (the nectar) based on the most important metrics to evaluate the goodness of a clustering, like 
davies-bouldin index (DBI) measure [25], Silhouette coefficient, homogeneity measure, completeness 
measure, v-measure, and Inertia. In addition to hybrid metrics, such as DBI-homogeneity-completeness 
measure, DBI-Silhouette measure, and DBI-V-measure, are applied. Below is an explanation of the concept 

of each of these metrics. 
— The DBI measure is calculated by comparing each cluster's average similarity to the most similar. The 
smaller the average similarity, the more distinct the groups are and the more accurate the clustering result. 
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— The Silhouette coefficient, aka the Silhouette score, is used to calculate the quality of a clustering method. 
The result of this coefficient is in the range [-1,1], where: i) 1: means that clusters are well-defined and 
distinct; ii) 0: means that clusters are indifferent to one another or that the distance between them is 
insignificant; and iii) -1: means that clusters were misallocated. 

— A homogeneity measure is a statistic that indicates the proportion of samples from a particular class that 
belong to a single cluster. The fewer distinct classifications that are included inside a cluster are, the 
better. The lower and upper limits of this measure should be 0.0 and 1.0, respectively, where higher is 
better. 

— Completeness measure: a perfectly complete clustering is one where all data points belonging to the same 
class are clustered into the same cluster. Completeness and homogeneity are symmetrical. 

— V-measure is equal to the harmonic mean of completeness and homogeneity. 

— Inertia is a simple criterion derived from the sum of square error, which is used as the criterion clustering 
function. It is the sum of distance squares between all samples and their centers of clustering. 

— Hybrid scale 1: In this scale, the DBI measure is combined with homogeneity and completeness measures 
using (5). 


Hybrid scale, = 0.5 * DBI core + 0.25 x homogeneity,core + 0.25 * completenessgcore (5) 
— Hybrid scale 2: In this scale, the DBI measure is combined with Silhouette coefficient using (6). 


Hybrid scale, = 0.6 * DBI score + 0.4 * (1 — Silhouettescore) (6) 


— Hybrid scale 3: In this scale, the DBI measure is combined with V-measure using (7). 
Hybrid scale3 = 0.6 * DBI core + 0.4 * (1 — Vmeasure) (7) 


3.3. Modified search equation 

The basic ABC algorithm was improved to process variable-length solutions. We have modified the 
search equation, namely MSEq, by following various modification directions, as discussed below. Figure 1 
shows the possible cases to produce a new individual V; from two individuals X; and Xx, with the lengths L; 
and Lz, respectively. 


3.3.1. MSEqi 

In the first case, MSEq:, the modification position j is chosen within the range of the smaller 
individual length that is shared between the two individuals, and (2) is applied. The length of the new 
individual is the same as the length of the first individual. Figure 1(a) shows the schematic diagram of the 
MSEq; example, assuming that the length of the selected food sources (solutions) is 7 and 5, respectively. 
The position j value, therefore, must be randomly chosen within the range [1,5]. In Figure 1(a), for instance, 
the value of j is selected as 3. 


3.3.2. MSEq2 

In the second case, MSEqz, if L; is less than L2 and the modification position j is chosen within the 
range of Lz but outside L;, then the values of X; are saved to V; and the value of X;; is saved to Vij+1. 
Therefore, the length of the new individual is L;+/. Figure 1(b) shows the schematic diagram of the MSEq2 
example, assuming that the length of the selected food sources (solutions) is 3 and 5, respectively. The 
position j value, therefore, is randomly chosen within the range [1,5]. If j falls outside 3, here is 4, then the 
value of X;,4 will be saved to V;4. 


3.3.3. MSEq3 

In the third case, MSEq;, if L; is bigger than L2 and the modification position j is chosen outside L2, 
(2) is applied on the value of X;; with any a random position at X;,;. The rest positions at X; are neglected. 
Therefore, the length of the new individual is j. 

Figure 1(c) shows the schematic diagram of the MSEq3 example, assuming that the length of the 
selected food sources (solutions) is 5 and 3, respectively. The position j value, therefore, is randomly chosen 
within the range [1,5]. If j falls outside 3, here is 4, then (2) is applied between X;4 and a random position at 
Xx, here 2, to represent V;4. The length of the new individual is 4. As can be seen above, the MSEq; differs 
from MSEq2 and MSEq; in the length of the new solution, which may vary from the original solutions at the 
MSEgq2 and MSEq; but at MSEq; still is the same as the first individual. The pseudocode of the ABCVL 
algorithm is given in algorithm 1. 
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Figure 1. The schematic diagram of (a) MSEqz, (b) MSEqz2, and (c) MSEq3 


Algorithm 1. The pseudocode of the ABCVL algorithm 

L; is the length of X; 

L, is the length of X; 

— Initialization phase using variable-length individuals (see Section 3.1); 
— Repeat 

— Employed bees phase using the MSEgq technique; 

— Onlooker bees phase using the MSEq technique; 

— Scout bees phase; 

— Memorize the best solution achieved so far; 

— Until (termination conditions are met). 


4. RESULTS 
4.1. Tested datasets 

To evaluate the efficiency of the ABCVL algorithm, three real-life datasets are used. The summary 
details of these datasets are: 1) Mall_Customers dataset [26] consists of 200 samples with five features: 
Customer_ID (Unique ID for each customer), gender (Customer’s gender, male or female), age (Customer's 
age), annual income (Customer's income per year, in $), and spending score (Customer’s spending score, 
from 1-100). This dataset is to be tested for two features (annual income and spending score). The optimum 
number of clusters is 5; ii) digits dataset [27] consists of 1797 8x8 grayscale images of handwritten digits. 
Each feature is an integer in the range 0...16. The optimum number of classes is 10 (code 0...9); and 
iii) breast cancer dataset [28] consists of 569 samples with two features (real or positive). It is divided into 
two classes: 212 malignant and 357 benign samples with a dimensionality of 30. The optimum number of 
classes is 2. 


4.2. Parameters setting 

In the artificial bee colony with variable-length (ABCVL) algorithm, the colony size (NP) was 40, 
with the number of food sources (SN) equal to NP/2. The minimum (C,,;,) and maximum (nCmax) number of 
optimization parameters (D) were 2 and 15, respectively. The value of abandonment limit counter “limit” 
was selected as SN*D. The maximum number of cycles was taken as 100. 


4.3. Results 
We carried a test using the ABCVL algorithm on two types of experiments. The first experiment, 
Experiment I, was performed to evaluate the effectiveness of the objective function on the performance of the 
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ABCVL algorithm. On the other hand, the second experiment, Experiment II, was performed to evaluate the 
variable-length individuals' effectiveness. 


4.3.1. Experiment I 

To select a suitable metric as an objective function, we tested eight metrics (Section 3.2) to evaluate 
the ABCVL algorithm for clustering problems on three datasets with taking into consideration an evaluation 
of performance. We executed the ABCVL algorithm ten times with each metric and chose the best one 
among them. Table | shows the results of this experiment in terms of the best value, mean, optimal k, and the 
mean of ten values of k, where the best results are highlighted in bold. 


Table 1. The results of Experiment I 


Method Datasets 
Mall_ Customers Digits Breast_cancer 
Best Mean Optimal Mean Best Mean Optimal Mean Best Mean Optimal Mean 

k (k) k (k) k (k) 
DBI 0.566707 0.570686 5 5.8 1.690358 1.704200 10 10.3 0.504404 0.506289 2 2.2 
Silhouette 0.446068 0.446068 5 5.0 0.809094 0.809863 12 11.9 0.302735 0.302735 2 2.0 
Homogeneity - - - - 0.154497 0.1611149 14 13.8 0.334320 0.336008 9 8.8 
Completeness - - - - 0.152885 0.158477 2 2.8 0.483191 0.483191 2 2.0 
V_measure - - - - 0.193189 0.201859 12 12.1 0.529086 0.529086 3 3.0 
Inertia 12738.31 13067.55 14 14.0 1043333.01 1052333 14 13.8 9432896.69 9601898 9 9.0 
Hybrid-scalel - - - - 0.975439 0.986869 10 11.5 0.517427 0.517427 2 2.0 
Hybrid-scale2 0.521398 0.521398 5 5.0 1.337068 1.355614 10 10.0 0.423736 0.423736 2 2.0 
Hybrid-scale3 - - - - 1.164177 1.176221 10 11.0 1.120295 1.132735 2 2.0 
4.3.2. Experiment II 


To test the performance of the ABCVL algorithm on clustering problems, we compared its results 
with the basic k-means algorithm on three clustering datasets. The results of the current experiment, in terms 
of the best value, mean and standard deviation produced by each algorithm throughout 30 runs, are shown in 
Table 2, where the best results are highlighted in bold. In addition, the ABCVL algorithm with the hybrid- 
scale2 function evolution curve is shown in Figure 2(a) to (c) for the three tested datasets. The coordinates of 
the cluster center produced are given in Tables 3-5 for three tested datasets. 


Table 2. The results of Experiment II 


Dataset The basic K-means algorithm The ABCVL algorithm 
Best Mean SD Best Mean SD 
Mall Customers 0.5213980  0.5895242 0.1056726  0.52139798  0.5209177 —0.002586571 
Digits 1.382326 1.503292  0.06446432 1.336862 1.345410  0.008708451 


Breast_cancer 0.4237363 _ 0.4237363.—-1.110x10"'° 0.42373629 0.4237363 1.110x10'6 


Table 3. Coordinates for cluster centers obtained by the ABCVL algorithm for the Mall_Customers dataset 


Dimensionality=2 Clustering Centers for Mall_Customers dataset 
1 2 3 4 5 
Annual Income 73.0 54.0 30.0 99.80159093758257 54.436110937843466 
Spending Score 88.0 48.0 73.0 81.04616983940875 99.0 


Table 4. Coordinates for cluster centers obtained by the ABCVL algorithm for digits dataset 


Dimensionality=64 Clustering centers for Digits dataset 

1 2 3 4 5 6 7 8 9 10 

0 0 0 0 0 0 0 0 0 0 0 

1 3.01693 0 0.22672 0 0 0 0 0 0 0 

2 8.35953 0 3.3012 13.1017 1 6 0 0 12 O 

3 16.000 13.215 = 14.478 14.369 8.984 15 8 8 16 4 

60 12.508 14.633 16.000 0.000 9.853 5 7 14 #O 16 

61 16.000 16.000 7.505 0 3.303 0 0 16 0 6 

62 1.031 11.529 1.6 0 0 0 0 4 0 0 

63 0 0 0 0 0 0 0 0 0 0 
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Table 5. Coordinates for cluster centers obtained by the ABCVL algorithm for Breast_cancer dataset 
Dimensionality=30 Clustering centers for Breast_cancer dataset 


1 2 
1 14.95 17.20 
2 18.77 24.52 
3 97.84 114.20 
ae 689.5 929.4 
26 ssi Pki 
27 0.2500 0.6566 
28 0.08405 0.18990 
29 0.2852 0.3313 
0 0.09218 0.13390 


5. DISCUSSION 

The results in Table 2 indicate that the best metric for evaluating the clustering process is the 
hybrid-scale2, i.e., the combination of DBI measure with the Silhouette coefficient. This metric gives the 
optimal number of k for all datasets and all experiment runs. Therefore, this metric will be used in the current 
system to evaluate the performance of the clustering method. The results have shown that applying the 
current work produces better performance than setting the parameters using trial-and-error. In addition, using 
the current work produces better results than the basic k-means algorithm relating to the best value, mean, 
standard deviation, and speed of finding the best value for the same tested datasets. As shown in Figure 2, the 
ABCVL gives a faster convergence speed for the Mall_Customers dataset in Figure 2(a) than it is with the 
Breast_cancer dataset in Figure 2(c), which in turn is faster than it is with the digits dataset in Figure 2(b). 
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Figure 2. The curve of the hybrid scale2 evolution for three tested datasets (a) Mall_Customers dataset, 
(b) digits dataset, and (c) Breast_cancer dataset 


6. CONCLUSION 

Choosing a suitable combination of parameters for the k-means algorithm, like the number of 
clusters k and the initial centroids, is a challenge and becomes more complicated when trading with complex 
problems. Selecting reasonable parameters depends on the nature of problems and applications. One potential 
solution to overcome this challenge using trial and error based on an expert valuation, but it may not be 
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performed efficiently in areas where the level of expertise is low. Moreover, even if expertise level is 
relevant, specifying a fair value for each parameter can become a difficult task and time-consuming. Another 
solution to determine the optimal number of clusters is semi-automatic approaches such as elbow, silhouette, 
and gap statistics. Unfortunately, these approaches are subjective. An alternative solution is to assign each 
parameter automatically. To achieve this goal, we proposed the first ABC algorithm with variable-length 
individuals, ABCVL; in fact, we have found no such attempt so far for the ABC algorithm. In the ABCVL 
algorithm, we first develop a new initialization technique to generate variable-length food sources to provide 
a suitable diversity level for the whole. Then, a special search equation is designed to deals with variable- 
length individuals. Overall, the current work achieved more encouraging findings than the traditional 
k-means algorithm on three real-life clustering datasets. They seem to be consistent with those of other 
studies and suggest that further work in this direction would be very worthwhile. In future work, we will take 
the order of the data into account since it impacts the final results. 
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